APLHA 99-03 



A New Technique for Sampling 
Multi-Modal Distributions 

K.J. Abraham 
Department of Physics and Astronomy 
Iowa State University 
Ames IA 50011 
e-mail abraham@iastate.edu 

L.M. Haines 
Department of Statistics and Biometry 
University of Natal 
Private Bag X01 
3209 Scottsville 
Pietermaritzburg 
South Africa 
e-mail haines@stat.unp.ac.za 



Abstract 

In this paper we demonstrate that multi-modal Probability Dis- 
tribution Functions (PDFs) may be efficiently sampled using an al- 
gorithm originally developed for numerical integration by monte-carlo 
methods. This algorithm can be used to generate an input PDF which 
can be used as an independence sampler in a Metropolis-Hastings 
chain to sample otherwise troublesome distributions. Some exam- 
ples in one, two, and five dimensions are worked out. We also com- 
ment on the possible application of our results to event generation in 
high energy physics simulations. (Subj. Classif. 68U20, 65C05, 81V25, 
81V15. Keywords Monte Carlo Optimisation, Metropolis-Hastings 
Chain, Vegas Algorithm, Independence Sampler). 



The key to solving a wide range of optimisation problems in science and 
engineering lies in being able to efficiently sample a (possibly very complex) 
PDF in one or more dimensions. In many cases of interest, this requires in- 
verting an integral which may not be possible by analytical or semi-analytical 
means. In such circumstances, efficient computer algorithms are crucial. The 
perhaps best known such algorithm is the Metropolis algorithm which 
can in principle be used to generate an accurate sample from any PDF no 
matter how complex, by a guided random walk. However, the Metropolis al- 
gorithm is potentially inefficent when confronted with a PDF with multiple 
modes, or peaks, especially if they are well seperated. As is well known, a 
very large number of random steps may needed to locate a new mode, once 
one mode has been discovered, leading to a dramatic drop in the efficiency of 
the scheme. In this paper we will show how this problem can be circumvented 
in a certain class of problems. 

In order to make the subsequent discussion more clear, we will present 
a brief analysis of the weakness of the Metropolis scheme outlined in the 
previous paragraph. Let Xj be some randomly choosen point in the space 
where the PDF of interest IT (not necessarily normalised), is to be sampled. 

A new point Xf at a distance 5 from Xi is choosen and the ratio is 

evaluated. If this ratio is larger than one, then the move Xi — > Xf is accepted. 
Otherwise it is accepted with probability • As can be imagined, locating 
a single peak of II can be easily accomplished. However, moving from one 
peak to another separated by a distance which is large compared with the 
stepsize 5 may require a long succession of steps "against the grain"; the 
net probability of such a sequence is sometimes so small that a prohibitively 
large number of trials may be needed in order to establish the existence of the 
second peak. This in a nutshell, is the reason for the potential inefficiency of 
the Metropolis algorithm alluded to earlier. 

One plausible remedy, varying 5 with each move has been incorporated 
into the Metropolis-Hastings algorithm H, where the sequence of steps is 
made on the basis of a proposal distribution. If the proposal distribution 
mimics n, then all the peaks of n may be found without difficulty. However, 
without prior knowledge of the separation between the peaks of n, it is diffi- 
cult to make a suitable choice for the proposal distribution. In other words, 
n must be mapped out globally in the region of interest before it has even 
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been studied. This requirement may appear to present an insurmountable 
obstacle to the use of the Metropolis-Hastings algorithm; the rest of this 
paper deals with methodology we have developed to deal with this problem. 

The key to our approach is the observation that the global structure of 
IT is required for another seemingly different problem, the evaluation of the 
definite integral of II over the region of interest. One technique for doing 
do which is easily adapted to integrands of higher dimensions is adaptive 
Monte Carlo simulation. A number of points are thrown at random along 
the boundaries of the region of interest (defining a grid) and the function is 
evaluated at these points. This process is repeated, however the second time 
around the grid from the first iteration is refined so that it is finer in regions 
where the function is larger and coarser where the function is smaller. On 
the third iteration, the grid previously obtained is further refined, and so 
on. After a suitable number of iterations a reliable estimate of the integral 
may be obtained, for a large class of integrands of interest. Several different 
variants of this basic algorithm have been developed; we use the VEGAS 
algorithm [|J. In VEGAS the grid points are used to subdivide the axes into 
a maximum of fifty bins. The bin boundaries may be used to break up the 
region of integration into a number of hypercubes. Ideally, the boundaries 
of the hypercubes are such that II integrated over each hypercube gives the 
same contribution to the definite integral of II over the region of interest. 
Smaller hypercubes would then correspond to regions where IT is large, larger 
hypercubes to regions where II is small. 

Quite apart from the definite integral, the grid information may also 
be used to define a PDF V which roughly mimics II. Sampling from V is 
straightforward; hypercubes are picked at random in such a way that the 
probability of picking any given hypercube is the same for all hypercubes, 
and a random number is used to locate a point X in the hypercube by 
uniform sampling. V is defined so that it is the same for all points in a 
given hypercube, and the value of V in a hypercube of volume Ay is 
More specifically, in one dimension a random number is used to pick a bin 
along the x axis in such a way that the probability of picking any bin is 
the same. Then a second random number is used to pick a point within 
the bin, all points within the bin sampled uniformly. AV is the bin width, 
so V for the point chosen is defined as the inverse of the width of the bin 
in which the point is located, independent of the precise point choosen in 
the bin. In two dimensions two random numbers are used to pick an area 
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element, and another two random numbers are used to pick a point in the 
area element. AV is now the area, so V at the point chosen is defined to 
be the inverse of the area element. In effect, we have sampled the function 
globally and have used VEGAS to adaptively construct a PDF V which is 
different from II which nonetheless mimics II. This procedure can obviously 
generalised to arbitrarily high dimensions. Regions where II is large (small) 
corespond the regions where AV is small (large) and hence to regions where 
V is large (small). 

Our strategy for sampling from II amounts to setting up a Metropolis- 
Hastings chain using V as a proposal distribution. From the discussion in 
the previous paragraph it is clear that regions where II are large are more 
likely to be selected than where II is small. A move X{ — > Xf is accepted 
(rejected) if 

x > rn(< rn) (1) 

V(X t ) U(X f ) 

where rn is a random number uniformly distributed between and 1. Essen- 
tially, we are using V as an independence sampler for IT. This method does 
preserve the condition of detailed balance and the stationary distribution of 
the resulting Markov Chain does indeed correspond to II [[|. Note that the 
fixed step size 5 plays no role whatsoever, rather 5 varies from move to move 
tuned to the seperation between the peaks of II. One potential objection to 
this scheme is that the function must be evaluated a large number of times by 
VEGAS before a random sample can be drawn from it and it is not obvious 
whether the number of function evaluations needed is less than would be re- 
quired in an approach with fixed step-size. This objection will be addressed 
in the example we consider. 

The first and simplest example we consider is a mixture of univariate 
gaussians defined in the interval [0, 22]. The precise function II is given by 

.5 {tf(x, 3, 1)} + .2 {Af(x, 14, .025)} + .3 {M(x, 19, .75)} (2) 

where N{x,x,a 2 ) denotes a uni-variate gaussian with mean x and variance 
a 2 . This function clearly has well-seperated multiple peaks; generating a 
sample from this PDF of this kind is thus liable to be problematic. 

The first step in our approach is to integrate II with VEGAS preserving 
the grid information generated by VEGAS. In this case the grid information 
is a set of 50 points in the interval [0,22]. The points define bins which are 
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such that the contribution to the definite integral from each bin is nearly 
equal. As expected, the bins are narrow (wide) where the integrand is large 
(small). II was evaluated 2500 times for this purpose and a grid reflecting 
the peaks in II was used to generate bins of varying widths. These bins were 
used to define V in the interval [0, 22] along the lines just described. V thus 
obtained has been plotted in Fig. 1; the correspondence between Fig. 1 and 
Eq. | is striking. 

The next step is to generate a sample from II using V as an independence 
sampler. The acceptance rate of the Metropolis-Hastings chain is remarkably 
high, about 80%; i.e. about 80 % of the moves were accepted using the crite- 
rion defined in Eq. [|. This is desirable from the point of view of minimising 
CPU time and reflects the accuracy with which V mimics the underlying dis- 
tribution II defined in Eq. [| In all, II was evaluated a total of 15,000 times 
to generate a sample. We have checked that the average value of the random 
variable as well as a number of higher moments are correctly reproduced, 
within statistical error bars. This implies not only that all peaks have been 
discovered but crucially, that the relative weights of all the peaks have also 
been correctly reproduced. By way of comparision, we have checked that 
running a Metropolis chain with the II evaluated over 100 000 times with 
fixed step size does not convincingly reproduce even the first two moments. 
The advantage of our approach is clear. 

We now go on to two dimensional examples. Here a complication arises; 
in dimensions larger than one the VEGAS algorithm implicitly assumes 
that V is factorisable; i.e. V may be accurately represented in the form 
V = Pi(xi)pj(xj) ■ ■ ■. For many functions of interest this is a reasonable ap- 
proximation, however if the function has a peak along a lower dimensional 
hypersurface other than a co-ordinate axis, this approximation may be a poor 
one. In particular, the VEGAS algorithm performs poorly if the function 
(assumed to be defined in a hypercube) has a peak along a diagonal of the 
hypercube. However, this does not mean that the distribution V generated 
from the VEGAS grid cannot be used to sample from n. All that happens 
is that the acceptance rate of the resulting Metropolis chain is lower. To 
illustrate this point, we consider a mixture of two bi-variate gaussians in a 
square whose means lie along a diagonal. The precise function is defined 
below. 

n = 0.7 {Q(x, y, 4, 4, 1, 1, .8)} + 0.3 {G(x, y, 12, 12, 1, 1, -.8)} (3) 
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where Q(x, y, fj, x , fiy, a x , a y , p) is defined by 



=^exp— - 

p 1 2(1 -p 2 ) 



(x-p x ) 2 (y-fj,y) 2 2p(x - (j, x )(y - Hy) 



+ 



On 



a x a v 



The region of integration is an (16 x 16) square with one corner at the ori- 
gin and sides along the positive x and y axes. This function is not well suited 
to evaluation by VEGAS as both peaks lie along a diagonal of the square, and 
this is reflected in the fact that the acceptance rate of the Metropolis-Hastings 
chain is only ~ 23%. However, the grid information does correctly reflect the 
location of both peaks and the values of < x n y m > where (m + n) < 6 are 
corectly reproduced, indicating not only that both peaks have been found, 
but also that the relative weights assigned to both peaks is correct. As a 
check, we have considered another function, 

n = 0.7 {G(x, y, 4,4, 1, 1, .8)} + 0.3 {G(x, y, 12,4, 1, 1, -.8)} 

which differs from the bi-variate gaussian in Eq. |3] in that both peaks now 
lie along a line parallel to the x axis. Once again, grid information is used 
to generate a sample from which correct moments can be recovered. This 
time though, due to the more favourable location of the peaks the accep- 
tance rate is almost twice as high as previously. We see again that an 
adaptive Monte Carlo approach can generate an independence sampler for a 
Metropolis-Hastings chain even when the target distribution n is two dimen- 
sional and has well seperated modes. It is worth pointing out that modifying 
n by the introduction of stepping stone distributions |5J has been suggested 
as a means to facilitate sampling PDFs of this nature; in our approach no 
such modifications are necessary. 

We conclude with a brief discussion of the relevance of our methods for 
event generation in experimental high energy physics simulations, where 
a sample from a potentially very complicated differential scattering cross- 
section dependent on more than two variables is required. If analytic inver- 
sion is not possible (as is often the case), another approach such as rejection 
sampling is needed. This however requires an enveloping distribution which 
must be somehow obtained, either by guesswork or possibly by using the VE- 
GAS grid information ||. Alternatively the grid information may be used 
to construct an importance sampler for a Metropolis-Hastings chain which 
can be used to generate events. To test this in practise, we have considered 
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the example of anomalous single t production in future 77 colliders, followed 
by t — > btv evaluated in the narrow width approximation for the t and W 
[0]. The five dimensional phase space has been integrated over with VEGAS 
and the resulting grid was used as an importance sampler to generate events 
along the lines of the previous examples. Neglecting the effects of cuts, smear- 
ing and hadronisation, we obtained an acceptance rate of about 75%, even 
though no attempt whatsoever was made to optimise the grid. In particular, 
our sampling did not make any use of simplifications resulting either from 
the use of the narrow width approximation or from the (V — A) structure 
of weak decays. This suggests that the methods we have outlined may be 
worthwhile incorporating into event generators for high energy physics, at 
least in instances when the phase space can be integrated over with VEGAS. 
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